Dear Jupyter Notebook reader,
to run the Notebook please install the requirements first, e.g. via pip install -r requirements.txt. All code has been tested on a MacOS machine with 32 GB of RAM and a Python 3.7 environment. Running this notebook may take up to 60 minutes depending on the hardware configuration of your machine. Therefore, a HTML version of this notebook containing all outputs is provided as well.
# Standard libraries
import time
from dateutil.relativedelta import relativedelta
import os
from itertools import starmap, product
from IPython.core.interactiveshell import InteractiveShell
# Third-party utility libraries
import pandas as pd
import numpy as np
import holidays
import geopy.distance
from scipy import stats
# Machine learning libraries
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
import folium
import plotly.express as px
# Styling
py.init_notebook_mode()
plt.style.use('tableau-colorblind10')
InteractiveShell.ast_node_interactivity = 'all'
pd.plotting.register_matplotlib_converters()
# Importing raw data
data = pd.read_csv('datafiles/raw/OPENDATA_BOOKING_CALL_A_BIKE.csv',sep=';', dtype={'DISTANCE': 'float64'})
# Filtering the data for relevant location
data = data[data.CITY_RENTAL_ZONE == 'Kassel']
# Transforming column DATE_BOOKING to datetime column
data['DATE_BOOKING'] = pd.to_datetime(data.DATE_BOOKING)
# Setting the index of Data to datetimeindex
data.set_index(data['DATE_BOOKING'], inplace=True)
# Filtering the Dataset to relevant years
data = data.loc['2015-01-01 00:00:00':'2016-12-31 23:59:59']
# Printing the value counts for columns that seem to have no varying information
for column in ['CATEGORY_HAL_ID', 'TRAVERSE_USE', 'DISTANCE', 'CITY_RENTAL_ZONE', 'TECHNICAL_INCOME_CHANNEL',
'RENTAL_ZONE_HAL_SRC', 'COMPUTE_EXTRA_BOOKING_FEE']:
print(data[column].value_counts(), '\n')
# Checking if DATE_BOOKING is the same as DATE_FROM
print("DATE_BOOKING = DATE_FROM \n",(data.DATE_BOOKING == pd.to_datetime(data.DATE_FROM)).value_counts())
The above shows that the columns CATEGORY_HAL_ID, TRAVERSE_USE, DISTANCE, CITY_RENTAL_ZONE, RENTAL_ZONE_HAL_SRC, COMPUTE_EXTRA_BOOKING_FEE and DATE_BOOKING do not include any information as they contain the same for any observation. These columns can therefore be eliminated.
The column TECHNICAL_INCOME_CHANNEL does contain different values for observations. These values however cannot be interpreted as there is no explanation provided by DB. Therfore this column can also be dropped.
| Column | Decision | Reason |
| BOOKIN_HAL_ID | KEPT | will be used to calculate the booking counts as it represents a unique booking |
| VEHICLE_HAL_ID | KEPT | will be used to calculate the number of vehicles that are used to satisfy the demand |
| CUSTOMER_HAL_ID | KEPT | kept for clustering purposes unique identifier of the customer |
| DATE_FROM | KEPT | equal to date_booking but since date booking is used as index this column is needed later |
| DATE_UNTIL | KEPT | kept for duration calculation and possible graphics |
| START_RENTAL_ZONE | KEPT | kept for clustering |
| START_RENTAL_ZONE_HAL_ID | KEPT | kept for later merge |
| END_RENTAL_ZONE | KEPT | kept for clustering |
| END_RENTAL_ZONE_HAL_ID | KEPT | kept for later merge |
| RENTAL_ZONE_HAL_SRC | REMOVED | always standort so no information |
| TECHNICAL_INCOME_CHANNEL | REMOVED | no explanation in the documentation and therefore not interpretable |
| COMPUTE_EXTRA_BOOKING_FEE | REMOVED | always no so dropped |
| DATE_BOOKING | REMOVED | duplicate of DateFrom |
# Dropping unnecessary columns
data.drop(columns={'CATEGORY_HAL_ID', 'TRAVERSE_USE', 'DISTANCE', 'CITY_RENTAL_ZONE', 'DATE_BOOKING',
'TECHNICAL_INCOME_CHANNEL', 'RENTAL_ZONE_HAL_SRC', 'COMPUTE_EXTRA_BOOKING_FEE'},
axis=1,
inplace=True)
# Printing a summary of data
print(data.info())
# Calculating the loss of eliminating observations with missing data
percentage_of_missing_values = round(((len(data) - len(data.dropna(axis=0))) / len(data)) * 100, 4)
print('\nLoss associated eliminating observations with missing values: {}%'.format(percentage_of_missing_values))
The columns START_RENTAL_ZONE, END_RENTAL_ZONE and END_RENTAL_ZONE_HAL_ID contain missing Values. Dropping the corresponding observations will only lead to a loss of 0.0046% which is acceptable.
# Dropping rows containing any missing value
data.dropna(axis=0, inplace=True)
# Importing data containing the geographical information
zone_data = pd.read_csv('datafiles/raw/OPENDATA_RENTAL_ZONE_CALL_A_BIKE.csv', sep=';')
# Filtering the dataset for relevant city
zone_data = zone_data[zone_data['CITY'] == 'Kassel']
# Dropping unnecessary columns
zone_data = zone_data[['RENTAL_ZONE_HAL_ID', 'LATITUDE', 'LONGITUDE']]
zone_data.head()
The columns LATITUDE and LONGITUDE are in the wrong formatting in order to use them for Distance calculation the ',' have to be replaced with '.'. Also both columns are interchanged, which has been tested with Google maps.
# Swapping LATITUDE and LONGTIUDE and replacing ',' with '.'
zone_data['LATITUDE'] = zone_data['LATITUDE'].apply(lambda x: float(str(x).replace(',', '.')))
zone_data['LONGITUDE'] = zone_data['LONGITUDE'].apply(lambda x: float(str(x).replace(',', '.')))
zone_data.rename(columns={'LONGITUDE': 'LATITUDE', 'LATITUDE': 'LONGITUDE'}, inplace=True)
zone_data.head()
# Mergin LATITUDE and LONGTIDUE for clustering analysis
data = data.merge(zone_data, how='left', left_on='START_RENTAL_ZONE_HAL_ID', right_on='RENTAL_ZONE_HAL_ID')
# Dropping RENTAL_ZONE_HAL_ID resulting from merging rental_zone since not needed
data.drop(columns=['RENTAL_ZONE_HAL_ID'], axis=1, inplace=True)
# Renaming columns LATITUDE and LONGITUDE to START_LATITUDE and START_LONGITUDE for cluster analysis
data.rename(columns={'LATITUDE': 'START_LATITUDE', 'LONGITUDE': 'START_LONGITUDE'}, inplace=True)
# Merging LATITUDE and LONGTIDUE for cluster analysis
data = data.merge(zone_data, how='left', left_on='END_RENTAL_ZONE_HAL_ID', right_on='RENTAL_ZONE_HAL_ID')
# Dropping RENTAL_ZONE_HAL_ID resulting from merging rental_zone since not needed
data.drop(columns=['RENTAL_ZONE_HAL_ID'], axis=1, inplace=True)
# Renaming columns LATITUDE and LONGITUDE to END_LATITUDE and END_LONGITUDE for cluster analysis
data.rename(columns={'LATITUDE': 'END_LATITUDE', 'LONGITUDE': 'END_LONGITUDE'}, inplace=True)
data['START_RENTAL_ZONE_HAL_ID'] = data['START_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
data['END_RENTAL_ZONE_HAL_ID'] = data['END_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
# Extracting the individual RENZAL_ZONE_IDs
zone_ids = zone_data['RENTAL_ZONE_HAL_ID']
# Creating a list containing every possible START/END zone combination
combinations = list(product(zone_ids, zone_ids))
# Creating a DataFrame for distance calculations having individual combinations as index
comb_frame = pd.DataFrame(index=combinations)
# For loop calculating the distance of each combination. If coordinates are not presen Distance is set to -1
for i in combinations:
start_latitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[0]].LATITUDE.values
start_longitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[0]].LONGITUDE.values
end_latitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[1]].LATITUDE.values
end_logitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[1]].LONGITUDE.values
try:
comb_frame.loc[i, 'DISTANCE'] = geopy.distance.distance((start_latitude, start_longitude),
(end_latitude, end_logitude)).km
except ValueError:
comb_frame.loc[i, 'DISTANCE'] = -1
# Resetting the index for the merge
comb_frame.reset_index(inplace=True)
# Converting the index to string for the merge
comb_frame['index'] = comb_frame['index'].apply(lambda x: str(x))
# Converting the ID columns to int
data['START_RENTAL_ZONE_HAL_ID'] = data['START_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
data['END_RENTAL_ZONE_HAL_ID'] = data['END_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
# Create column that can be matched with the index column of comb_frame
data['combination'] = '(' + data['START_RENTAL_ZONE_HAL_ID'].astype(str) + ', ' + data['END_RENTAL_ZONE_HAL_ID'].astype(
str) + ')'
# Merging data and comb_frame
data = data.merge(comb_frame, how='left', left_on='combination', right_on='index')
# Dropping columns that were necessary for merging
data.drop(columns=['index', 'combination'], axis=1, inplace=True)
# Deleting data that is not used anymore to free RAM
del [comb_frame, zone_data, combinations, zone_ids]
# Calculating booking duration by end - start
data['BOOKING_DURATION'] = pd.to_datetime(data.DATE_UNTIL) - pd.to_datetime(data.DATE_FROM)
# Transforming booking duration to int
data['BOOKING_DURATION'] = data.BOOKING_DURATION.apply(lambda x: x.total_seconds() / 60)
# Creating final cluster data variable and remove unrealistic bookings
cluster_data = data.copy()
cluster_data = cluster_data.loc[cluster_data['BOOKING_DURATION'] < 10000, :]
For prediction, the only useful feature of the original dataset is the number of bikes rented in a specific time frame. Even though prediction could be done on the own trajectory only, adding additional information from other sources can enhance predictive accuracy. To be able to merge hourly climate data and to ensure evenly spaced time steps the data will be re-sampled to a one hour resolution, which is achieved by counting the number of unique rentals in a specific hour. After this the climate data and information on holidays will be merged. Since the dataset then contains all available information the dataset will then be re-sampled to a two, six and twenty four hour resolution choosing appropriate aggregation for each feature.
# Setting a datetimeindex for resampling
data.set_index(pd.to_datetime(data['DATE_FROM']), inplace=True)
# Dropping columns that are not needed during predictive analysis
data.drop(columns=['END_RENTAL_ZONE', 'END_RENTAL_ZONE_HAL_ID', 'START_RENTAL_ZONE', 'START_RENTAL_ZONE_HAL_ID',
'CUSTOMER_HAL_ID', 'DATE_UNTIL', 'START_LATITUDE', 'START_LONGITUDE', 'END_LATITUDE', 'DATE_FROM',
'END_LONGITUDE','VEHICLE_HAL_ID','BOOKING_DURATION','DISTANCE'], axis=1, inplace=True)
# Resampling to one hour resolution
one_h_data = data.resample('H').agg({'BOOKING_HAL_ID': np.count_nonzero})
Since the meteorological station of Kassel has moved in 2013 (for reference consult Link), the climate data used in this analysis is taken from the station in Schauenburg Elgershausen which is only eleven kilometers away from Kassel. The data was obtained from the climate data center of the DWD (Link) using the associated station number is 15207.
The following climate features have been used:
# Importing and preparing data containig information of hourly sunshine duration
sd_path = 'datafiles/weather/stundenwerte_SunshineDuration_15207_20131101_20181231_hist/produkt_sd_stunde_20131101_20181231_15207.txt'
sd_data = pd.read_csv(sd_path, sep=';')
sd_data['MESS_DATUM'] = pd.to_datetime(sd_data.MESS_DATUM, format='%Y%m%d%H')
sd_data.set_index('MESS_DATUM', inplace=True, drop=True)
sd_data.rename(columns={'SD_SO': 'SUNSHINE_DURATION'}, inplace=True)
sd_data.drop(['QN_7', 'eor', 'STATIONS_ID'], axis=1, inplace=True)
# Importing and preparing data containig information of hourly temparature and humidity
temp_path = 'datafiles/weather/stundenwerte_TemperatureHumidity_15207_20131101_20181231_hist/produkt_tu_stunde_20131101_20181231_15207.txt'
temp_data = pd.read_csv(temp_path, sep=';')
temp_data['MESS_DATUM'] = pd.to_datetime(temp_data.MESS_DATUM, format='%Y%m%d%H')
temp_data.set_index('MESS_DATUM', inplace=True, drop=True)
temp_data.rename(columns={'TT_TU': 'TWO_METER_AIR_TEMP', 'RF_TU': 'TWO_METER_HUMIDITY'}, inplace=True)
temp_data.drop(['QN_9', 'eor', 'STATIONS_ID'], axis=1, inplace=True)
# Importing and preparing data containig information of hourly precipitation and if rain has fallen
rain_path = 'datafiles/weather/stundenwerte_Precipitation_15207_20131101_20181231_hist/produkt_rr_stunde_20131101_20181231_15207.txt'
rain_data = pd.read_csv(rain_path, sep=';')
rain_data['MESS_DATUM'] = pd.to_datetime(rain_data.MESS_DATUM, format='%Y%m%d%H')
rain_data.set_index('MESS_DATUM', inplace=True, drop=True)
rain_data.rename(columns={' R1': 'HOURLY_PRECIPITATION_HEIGHT', 'RS_IND': 'RAIN_FALLEN'}, inplace=True)
rain_data.drop(['QN_8', 'eor', 'STATIONS_ID', 'WRTR'], axis=1, inplace=True)
# Importing and preparing data containig information of hourly mean wind speed
wind_path = 'datafiles/weather/stundenwerte_Wind_15207_20131101_20181231_hist/produkt_ff_stunde_20131101_20181231_15207.txt'
wind_data = pd.read_csv(wind_path, sep=';')
wind_data['MESS_DATUM'] = pd.to_datetime(wind_data.MESS_DATUM, format='%Y%m%d%H')
wind_data.set_index('MESS_DATUM', inplace=True, drop=True)
wind_data.rename(columns={' F': 'MEAN_WIND_SPEED', ' D': 'MEAN_WIND_DIRECTION'}, inplace=True)
wind_data.drop(['QN_3', 'eor', 'STATIONS_ID', 'MEAN_WIND_DIRECTION'], axis=1, inplace=True)
# Merging climate data to main dataset
one_h_data = one_h_data.merge(sd_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(temp_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(rain_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(wind_data, how='left', left_index=True, right_index=True)
# Deleting data that is not used anymore to free RAM
del [wind_data, wind_path, rain_data, rain_path, temp_data, temp_path, sd_data, sd_path]
Now that the data is in a reasonable format and contains additional climate features, the completeness of the set will be examined. After that missing values will be imputed using appropriate methods
one_h_data.info()
From the above it can be seen the climate data contains missing values. Especially the feature SUNSHINE_DURATION is not complete. One reason for this is that at night, where sunshine duration is not measurable, the DWD does record missing values. These missing values can therefore be logically filled with 0's since sun is not shining at night times.
The rest of the features contain an equal amount of missing values all at in the same time frame. Consequently these missing values could not be estimated using methods like regression. Therefore the build-in method time is chosen which linearly interpolates from the last non-missing observation to the next available non-missing observation
# Replacing the code for missing values (-999) of DWD to nan
for column in one_h_data.columns[-6:]:
one_h_data[column].replace(-999.0, np.nan, inplace=True)
# Extracting the hour of the date and casting it to type int
one_h_data['HOUR'] = pd.to_datetime(one_h_data.index).strftime('%-H')
one_h_data['HOUR'] = one_h_data['HOUR'].apply(lambda x: int(x))
# Imputing missing values of SUNSHINE_DURATION at night with zero and drop column HOUR
one_h_data.loc[
one_h_data.SUNSHINE_DURATION.isna() & ((one_h_data.HOUR <= 3) | (one_h_data.HOUR >= 23)), ['SUNSHINE_DURATION']] = 0
del one_h_data['HOUR']
# Impute the rest of missing values using method time
for column in one_h_data.columns[-6:]:
one_h_data[column].interpolate(method='time', inplace=True)
Since rental demand can vary by working and non-working day in the following school as well as bank holiday information is added to the data as a binary feature for each. Information about bank holidays is obtained using the package holidays which allows to draw the dates which are bank holidays of a specific country and region. School holidays are set manually specifying the start and end date of a specific school-free time.
# Importing german holidays of province hessen
ger_holidays = holidays.CountryHoliday('DE', prov='HE')
# Creating individual lists for year, month and day of month on the basis of the data
year_int = one_h_data.index.strftime('%Y').astype(int)
month_int = one_h_data.index.strftime('%m').astype(int)
day_int = one_h_data.index.strftime('%d').astype(int)
def get_holiday(dyear, dmonth, dday):
"""
Function to check wether a certain year, month, day of month combination is a bank holiday
:param dyear: year as an integer
:param dmonth: month as an integer
:param dday: day of the month as an integer
:return: 1 if combination is on a bank holiday, 0 otherwise
"""
is_holiday = holidays.date(dyear, dmonth, dday) in ger_holidays
return is_holiday
# Application of get_holiday to the previously specified combinations
holidayList = list(map(get_holiday, year_int, month_int, day_int))
# Assingning the list of holidays to individual column
one_h_data['BANK_HOLIDAY'] = holidayList
# Converting bool to int
one_h_data['BANK_HOLIDAY'] = one_h_data['BANK_HOLIDAY'].apply(lambda x: int(x))
# School holidays 2015
one_h_data.loc[
(one_h_data.index >= '2015-01-01') & (one_h_data.index <= '2015-01-11 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2015-03-30') & (one_h_data.index <= '2015-04-12 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2015-07-27') & (one_h_data.index <= '2015-09-06 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2015-10-19') & (one_h_data.index <= '2015-10-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2015-12-23') & (one_h_data.index <= '2015-12-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
# School holidays 2016
one_h_data.loc[
(one_h_data.index >= '2016-01-01') & (one_h_data.index <= '2016-01-10 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2016-03-29') & (one_h_data.index <= '2016-04-10 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2016-07-18') & (one_h_data.index <= '2016-08-28 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2016-10-17') & (one_h_data.index <= '2016-10-30 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
(one_h_data.index >= '2016-12-22') & (one_h_data.index <= '2016-12-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
# Filling days which are not in school holidays with 0
one_h_data.loc[one_h_data.SCHOOL_HOLIDAY.isnull(), 'SCHOOL_HOLIDAY'] = 0
# Renaming columns for later analysis
one_h_data.rename(columns={'BOOKING_HAL_ID': 'COUNT', 'TWO_METER_AIR_TEMP': 'AIR_TEMP',
'TWO_METER_HUMIDITY': 'HUMIDITY','HOURLY_PRECIPITATION_HEIGHT': 'PRECIPITATION_HEIGHT'},
inplace=True)
Since now all additional features are contained in the data, the data will be re-sampled to two, six and twenty four hour resolutions. The features will be aggregated in regard to their type of recording i.e. means will be aggregated using their means, counts or normal measurements will be summed and binary features will stay binary encoded.
# Resampling to two hour resolution
two_h_data = one_h_data.resample('2H').agg({'COUNT': np.sum,
'SUNSHINE_DURATION': np.sum,
'AIR_TEMP': np.mean,
'HUMIDITY': np.mean,
'PRECIPITATION_HEIGHT': np.sum,
'RAIN_FALLEN': lambda x: x.nunique() - 1,
'MEAN_WIND_SPEED': np.mean,
'BANK_HOLIDAY': np.mean,
'SCHOOL_HOLIDAY': np.mean})
# Resampling to six hour resolution
six_h_data = one_h_data.resample('6H').agg({'COUNT': np.sum,
'SUNSHINE_DURATION': np.sum,
'AIR_TEMP': np.mean,
'HUMIDITY': np.mean,
'PRECIPITATION_HEIGHT': np.sum,
'RAIN_FALLEN': lambda x: x.nunique() - 1,
'MEAN_WIND_SPEED': np.mean,
'BANK_HOLIDAY': np.mean,
'SCHOOL_HOLIDAY': np.mean})
# Resampling to 24 hour resolution
twentyfour_h_data = one_h_data.resample('1D').agg({'COUNT': np.sum,
'SUNSHINE_DURATION': np.sum,
'AIR_TEMP': np.mean,
'HUMIDITY': np.mean,
'PRECIPITATION_HEIGHT': np.sum,
'RAIN_FALLEN': lambda x: x.nunique() - 1,
'MEAN_WIND_SPEED': np.mean,
'BANK_HOLIDAY': np.mean,
'SCHOOL_HOLIDAY': np.mean})
| Column | dtype | Description |
|---|---|---|
| COUNT | int64 | The number of bikes rented in the next period. |
| SUNSHINE_DURATION | float64 | The amount of sun shining in the next period in minutes. |
| AIR_TEMP | float64 | The mean temperature at two meters above ground in the next period in degress. |
| HUMIDITY | float64 | The mean humidity at two meters above ground in the next period in percent. |
| PRECIPITATION_HEIGHT | float64 | The amount precipitation in the next period in mm. |
| RAIN_FALLEN | float64 | Binary variable indicating whether rain has fallen in the next period. |
| MEAN_WIND_SPEED | float64 | The mean wind speed in the next period in kmph. |
| BANK_HOLIDAY | int64 | Binary variable indicating whether there is a bank holiday in the next period. |
| SCHOOL_HOLIDAY | float64 | Binary variable indicating whether there is a school holiday in the next period. |
delete the next two cells if there is no important information maybe write about missing values in some columns otherwise drop
cluster_data.head()
cluster_data.info()
For an overview of the number of started bookings for all rental zones, the dataframe start_stations is created. It groups the rental zones by aggregating the sum of total departure bookings for a period of two years.
# Creating a dataframe, aggregating the sum of total departures by grouping the rental zones.
start_stations = cluster_data[['START_RENTAL_ZONE', 'START_LATITUDE', 'START_LONGITUDE', 'BOOKING_HAL_ID']].groupby(['START_RENTAL_ZONE', 'START_LATITUDE', 'START_LONGITUDE'])["BOOKING_HAL_ID"].count().reset_index(name="TOTAL_DEPARTURES")
start_stations.dropna(axis=0, inplace=True)
start_stations.head()
Following, a map of start rental zones in Kassel. It is sufficient to map only the starting rental zones, because any rental zone where no booking has been made within the two-year period is unlikely to be of high relevance.
In the first iteration of the data visualization, the rental zones were mapped in the Indian Ocean. It was therefore clear that it is necessary to switch the column names of LONGITUDE and LATITUDE.
# Building map of total departures for rental zones
m = folium.Map(
location=[51.312013, 9.481551],
tiles='OpenStreetMap',
zoom_start=12
)
# Add markers for each rental zones
tooltip = 'Click me!'
for index, row in start_stations.iterrows():
folium.Marker([row['START_LATITUDE'], row['START_LONGITUDE']], radius=10,popup=row['START_RENTAL_ZONE'], tooltip='Click to see the station name').add_to(m)
m
Analogue to the start_stations there is an end_stations dataframe created for an overview of the number of ended bookings for all rental zones. It groups the rental zones by aggregating the sum of total arrival bookings for a period of two years.
# Creating a dataframe, aggregating the sum of total arrivals by grouping the rental zones.
end_stations = cluster_data[['END_RENTAL_ZONE', 'END_LATITUDE', 'END_LONGITUDE', 'BOOKING_HAL_ID']].groupby(['END_RENTAL_ZONE', 'END_LATITUDE', 'END_LONGITUDE'])['BOOKING_HAL_ID'].count().reset_index(name='TOTAL_ARRIVALS')
end_stations.dropna(axis=0, inplace=True)
end_stations.head()
The following heatmap shows the previously aggregated values of total arrivals over the two-year period per rental station from the dataframe end_stations.
# Creating heatmap of arrivals per rental zone
fig = px.scatter_mapbox(end_stations, lat="END_LATITUDE", lon="END_LONGITUDE", hover_name="END_RENTAL_ZONE", hover_data=["TOTAL_ARRIVALS"],
color="TOTAL_ARRIVALS", size="TOTAL_ARRIVALS", color_continuous_scale=[[0, 'rgba(150, 150, 255, 0.85)'],
[1, 'rgba(0,0,255, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
After an analysis of the map, it is apperent that the rental zones with the most arrivals (where the most bookings have ended) are close to or are at university locations.
The following heatmap shows the previously aggregated values of total departures over the two-year period per rental station from the dataframe start_stations.
# Creating heatmap of departures per rental zone
fig = px.scatter_mapbox(start_stations, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", hover_data=["TOTAL_DEPARTURES"],
color="TOTAL_DEPARTURES", size="TOTAL_DEPARTURES", color_continuous_scale=[[0, 'rgba(255, 180, 180, 0.85)'],
[1, 'rgba(255,0,0, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
After an analysis of the map, it is apperent that the most bikes were rented at the rental zone "Uni-Kassel / Manzelstr.". Also, what is noticebale, is that many bikes are rented in the proximity of train stations or university locations.
To show a surplus or deficiency of available bikes at the rental zones, the net depature of each rental zone is calculated by subtracting total depatures from total arrivals.
# Calcuation of net depature (= arrivals - depatures)
net_departures = pd.merge(start_stations[['START_RENTAL_ZONE', 'TOTAL_DEPARTURES']], end_stations, left_on="START_RENTAL_ZONE", right_on="END_RENTAL_ZONE", how='left')
net_departures['NET_DEPARTURE'] = net_departures['TOTAL_ARRIVALS'] - net_departures['TOTAL_DEPARTURES']
net_departures.rename(columns={'START_RENTAL_ZONE':'RENTAL_ZONE', 'END_LONGITUDE':'LONGITUDE', 'END_LATITUDE':'LATITUDE'}, inplace=True)
net_departures = net_departures[['RENTAL_ZONE', 'LATITUDE', 'LONGITUDE', 'TOTAL_DEPARTURES', 'TOTAL_ARRIVALS', 'NET_DEPARTURE']]
net_departures.head()
The heatmap plotted below shows the calculated net depatures for each rental zone. If there is a deficiency in available bikes at a given rental zone (negative net depature), the color of the marker is red. Contrarily, if there is a surplus in avaliable bikes at a given rental zone (positive net depature), the color of the marker is blue.
# Creapting heatmap of net departures per rental zone
fig = px.scatter_mapbox(net_departures, lat="LATITUDE", lon="LONGITUDE", hover_name="RENTAL_ZONE", hover_data=["NET_DEPARTURE"],
color="NET_DEPARTURE", size=net_departures["NET_DEPARTURE"].abs(), color_continuous_scale=[[0, 'rgba(255, 0, 0, 0.85)'], [abs(min(net_departures['NET_DEPARTURE'])/(abs(min(net_departures['NET_DEPARTURE']))+max(net_departures['NET_DEPARTURE']))), 'rgba(255, 255, 255, 0.85)'], [1, 'rgba(0,0,255, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
The map reveals that there is a net shift of bikes from train stations in the western part of the city to university locations in the eastern part of the city. It is also salient that the two rental zones "ICE Bahnhof / Willy-Brandt-Platz" and the "Hauptbahnhof / Rainer-Dierichs-Platz" have particularly high negative net values. In contrast the rental zones "Uni-Kassel / Menzelstr." and "Weserspitze / Weserstraße" have particularly high positive net values.
To examine if there is a shift from a certain group of rental zones to another group of rental zones in the course of a week (e.g. higher demands in the outscirts on the weekend), an animated plot of rentals per rental zone and weekday is needed. To create an animation of bookings on a weekday-basis, there is a column with the accoring weekday needed. It is deduced from the DATE_FROM column. The dataframe is then grouped by the rental zone and the weekday and is aggregated with the sum of bookings. The results of the counts of bookings for each rental zone is then shown in a map, animated for every weekday.
# Creating dataframe weekly_animation from copy of cluster_data and add WEEKDAY column
weekly_animation = cluster_data.copy()
weekly_animation['WEEKDAY'] = pd.to_datetime(cluster_data['DATE_FROM']).apply(lambda x: x.weekday())
weekly_animation.head()
# Grouping by START_LAT, START_LONG, NAME, WEEKDAY
weekly_animation_group = weekly_animation.groupby(['START_LATITUDE', 'START_LONGITUDE', 'START_RENTAL_ZONE', 'WEEKDAY']).count()
weekly_animation_group = weekly_animation_group.add_suffix('_COUNT').reset_index().sort_values(by='WEEKDAY')
weekly_animation_group.head()
# To cover the time aspect the following animated map reveals the rental zone load on a shifting time frame per weekday
fig = px.scatter_mapbox(weekly_animation_group, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", size="CUSTOMER_HAL_ID_COUNT",
animation_frame="WEEKDAY", zoom=11.8, height=600)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Unfortunately it is not possible to observe a significant change in demand patterns for the different weekdays. The visualization only shows that the highest demand is between Thuesday and Thursday. On the weekend there is a lower demand of bikes.
To examine if there is a shift from a certain group of rental zones to another group of rental zones in the course of a day (e.g. from the outscirts to the inner city in the morining), an animated plot of rentals per rental zone and hour of day is needed. To create an animation of bookings on a hourly-basis, there is a column with the accoring hour of day needed. It is deduced from the DATE_FROM column. The dataframe is then grouped by the rental zone and the hour of day and is aggregated with the sum of bookings. The results of the counts of bookings for each rental zone is then shown in a map, animated for every hour of day.
# Creating dataframe hourly-animation from copy of cluster_data and add the hour of day
hourly_animation = cluster_data.copy()
hourly_animation['HOUR'] = pd.to_datetime(cluster_data['DATE_FROM']).apply(lambda x: x.hour)
hourly_animation.head()
# Grouping by START_LAT, START_LONG, NAME, HOUR
hourly_animation_group = hourly_animation.groupby(['START_LATITUDE', 'START_LONGITUDE', 'START_RENTAL_ZONE', 'HOUR']).count()
hourly_animation_group = hourly_animation_group.add_suffix('_COUNT').reset_index().sort_values(by='HOUR')
hourly_animation_group.head()
# To cover the time aspect the following animated map reveals the rental zone load on a shifting time frame per hour of a day
fig = px.scatter_mapbox(hourly_animation_group, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", size="CUSTOMER_HAL_ID_COUNT",
animation_frame="HOUR", zoom=11.8, height=600)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
This visualization shows a peak demand between 7am and 7pm. The lowest demand is shown between 1am and 7am with a minmun at 6am. It is observable that the demand for the various rental zones increase and shrink at a similar rate in the hour.
In addition to the pure rental station pattern visualization it is also possible to cluster the stations based on time and popularity of the specific station. Time in this case will be treated on a daily basis. Therefore a HOUR_OF_DAY column is appended to the clustering dataset containing the hour of day from 0 to 24.
# Add HOUR_OF_DAY column to cluster_data
cluster_data['HOUR_OF_DAY'] = pd.to_datetime(cluster_data['DATE_FROM']).dt.hour
In order to identify different patterns depending on the start and end rental zones, the following code groups the dataframe according to the start and respecitvely end rental zone by counting the number of bookings at each station for each hour of the day. The booking count represents the metric for popularity in the following.
# Create respective dataframes for the datasets in order to cluster by HOUR_OF_DAY and COUNT of bookings
booking_start_counts = cluster_data.groupby(['START_LATITUDE', 'START_LONGITUDE', 'HOUR_OF_DAY']).size().reset_index(name="COUNT")
booking_end_counts = cluster_data.groupby(['END_LATITUDE', 'END_LONGITUDE', 'HOUR_OF_DAY']).size().reset_index(name="COUNT")
booking_start_counts.head()
# Define functions for later use
def map_hour_to_cyclic_time(hour):
"""
Maps hours to integer values in such a way that afternoon and noon have the same distance to the night.
This is required for k-means to work properly to deliver reasonably results because it depends on the euclidean metric.
"""
if (hour >= 5) and (hour <= 13):
return 1
elif (hour >= 14) and (hour <= 21):
return 3
return 2
def plot_elbow_graph(df):
"""
The elbow method is used to visualize the cluster losses of k-means obtained by the model.inertia_
"""
clusters = []
losses = []
for k in range(10):
model = KMeans(n_clusters=k + 1, random_state=1337, init='k-means++')
model.fit(df)
clusters.append(k + 1)
losses.append(model.inertia_)
plt.plot(clusters, losses)
plt.show()
K-means is very sensitive to outliers. HOUR_OF_DAY and COUNT must be scaled properly for the best results. Sklearn's StandardScaler is used here.
# Scale values and plot elbow graph
scaler = StandardScaler()
booking_start_counts_scaled = booking_start_counts.copy()
booking_start_counts_scaled['HOUR_OF_DAY'] = booking_start_counts_scaled['HOUR_OF_DAY'].apply(map_hour_to_cyclic_time)
booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']] = scaler.fit_transform(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
booking_end_counts_scaled = booking_end_counts.copy()
booking_end_counts_scaled['HOUR_OF_DAY'] = booking_end_counts_scaled['HOUR_OF_DAY'].apply(map_hour_to_cyclic_time)
booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']] = scaler.fit_transform(booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
plot_elbow_graph(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
Important to note is that the k-means++ algorithm instead of the "normal" one is used to fit the model. The advantage of the k-means++ is that it is not sensitive to find local optima because it selects the cluster centroids for the defined number of clusters randomly. This decreases the probability of finding a local optimum dramatically.
# Fit a k-means++ model with the booking start counts
n_clusters = 4
model = KMeans(n_clusters=n_clusters, random_state=1337, init='k-means++')
model.fit(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
# Plot clustered start rental zone bookings
clustered_booking_start_counts = booking_start_counts.copy()
# Attach the cluster result to the unscaled dataframe for meaningful plotting
clustered_booking_start_counts['CLUSTER'] = model.predict(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
sns.scatterplot(x='HOUR_OF_DAY', y='COUNT', data=clustered_booking_start_counts, hue='CLUSTER').set_title('Clustered start rental zone bookings')
plt.show()
The cluster 0 is both, in the right and the left of the plot, because the distance has been mapped with the map_hour_to_cyclic_time function.
Additionally, it is important to note that the end rental zones are clustered with the start rental zone model to have a good comparison between the two results.
# Plot clustered end rental zone bookings (clustered with the start rental zone model for comparative power)
clustered_booking_end_counts = booking_end_counts.copy()
clustered_booking_end_counts['CLUSTER'] = model.predict(booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
sns.scatterplot(x='HOUR_OF_DAY', y='COUNT', data=clustered_booking_end_counts, hue='CLUSTER').set_title('Clustered end rental zone bookings')
plt.show()
# Afternoon
sns.scatterplot(x='START_LATITUDE', y='START_LONGITUDE', data=clustered_booking_start_counts[clustered_booking_start_counts['CLUSTER'] == 3], hue='COUNT', size='COUNT')
plt.show()
sns.scatterplot(x='END_LATITUDE', y='END_LONGITUDE', data=clustered_booking_end_counts[clustered_booking_end_counts['CLUSTER'] == 3], hue='COUNT', size='COUNT')
plt.show()
In the afternoon, many people drive from the city to the outscirts.
In the following the cusotmers will be clustered by their bookings. It is possible that there is a correlation between the booking duration and the distance between the start and end rental zone. Therfore, as a first clustering task, the bookings are plotted by booking duration and distance and then clustered.
First, a copy of the dataframe cluster_data with the columns 'BOOKING_DURATION' and 'DISTANCE' is created as X.
# Setting X as duration and distance for clustering
X = cluster_data[['BOOKING_DURATION', 'DISTANCE']].copy()
X.dropna(axis=0, inplace=True)
X.head()
Because k-means++ build clusters with the eclueadean distance, the calculation of clusters is very sensitive to outliers. The clusters would be distorted in that case. Therefore, clusters have to be removed. But as a first step, the data has to be plotted to find out if there are outlieres.
# Building boxplot for outlier detection
X.boxplot(['BOOKING_DURATION'],figsize = (10,20),grid=True);
The boxplot shows that the there are many extreme outliers with a distance of more than one and a half interquartile range. Therfore, it is necessary to remove them.
For the removal of the outliers, the z-score is used.
"The Z-score method relies on the mean and standard deviation of a group of data to measure central tendency and dispersion. This is troublesome, because the mean and standard deviation are highly affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the biggest reasons for finding and removing outliers from a dataset!
The Z-score, or standard score, is a way of describing a data point in terms of its relationship to the mean and standard deviation of a group of points. Taking a Z-score is simply mapping the data onto a distribution whose mean is defined as 0 and whose standard deviation is defined as 1.
The goal of taking Z-scores is to remove the effects of the location and scale of the data, allowing different datasets to be compared directly. The intuition behind the Z-score method of outlier detection is that, once we’ve centred and rescaled the data, anything that is too far from zero (the threshold is usually a Z-score of 3 or -3) should be considered an outlier." [http://colingorrie.github.io/outlier-detection.html#z-score-method]
# Calculating z-score
z = np.abs(stats.zscore(X['BOOKING_DURATION']))
# Cleaning data for clustering
X_Cleaned = X.loc[z<3,:]
X_Cleaned.describe()
As seen in the description of the data above, the maximal duration is only 419 min instead of 9204 min.
Because the units of distance and duration are different, the data has to be scaled. This makes it possible to calcualte more natural clusters.
# Scale the data because k-means++ is sensitive to outliers
scaler = StandardScaler();
scaler.fit(X_Cleaned);
X_scaled = scaler.transform(X_Cleaned);
X_scaled_df = pd.DataFrame(X_scaled,columns=X_Cleaned.columns, index = X_Cleaned.index);
X_scaled_df.head()
For a first impression of the data, a scatterplot is created.
# Scatterplot
plt.figure(figsize=(15, 10), dpi=80)
fig = plt.scatter(data=X_scaled_df, x='BOOKING_DURATION',y='DISTANCE',s=6);
On the first glance, the duration and distance does not correlate as expected. A longer booking does not mean that the distance is longer.
To get a visual estimate of the best amount of clusters, an elbow method is plotted.
# Plotting elbow method
plot_elbow_graph(X_scaled_df)
The plot of the elbow method suggests that 5 clusters are a good fit.
# Fit a k-means++ model
five_means = KMeans(n_clusters = 5, random_state=1337, init='k-means++');
five_means.fit(X_scaled);
five_means.predict(X_scaled);
For an easier interpretation of the later plot, the first scaled data is now scaled back.
# Scaling back for better understanding of plots
five_means_new_inverse = scaler.inverse_transform(X_scaled)
five_means_new_inverse_df = pd.DataFrame(five_means_new_inverse,columns=X_Cleaned.columns, index = X_Cleaned.index)
five_means_new_inverse_df.describe()
# Plotting the five clusters
flatui = ["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]
numbers = ["zero", "one", "two", "three", "four", "five"]
five_means_new_inverse_df["five"] = five_means.predict(X_scaled)
five_means_new_inverse_df["five"] = five_means_new_inverse_df["five"].apply(lambda x: numbers[x])
sns.pairplot(data=five_means_new_inverse_df,height=8,hue='five',palette=sns.color_palette(flatui));
For a better visualization, the plot in the lower left corner is plotted seperately in in the following cell.
# Scatterplot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=five_means_new_inverse_df, y='DISTANCE', x='BOOKING_DURATION', hue='five', palette=sns.color_palette(["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]),s=8);
The bookings were divided into 5 clusters:
The customers with a distance of zero could be round-trip driver. Trips in the dark blue cluster could be day trips where the bike is rented for a few hours. The red, yellow, and light blue trips are probably commuter trips.
It may be interesting to cluster the customers by their most frequent start rental time and their amount of bookings. A result could be that customers who rent a similar amount of time also rent at the same time (e.g. commuter, pleasure biker, etc.).
In order to do that clustering, the dateframe customer_bookings is created. It lists the hours of the start booking times of all bookings per customer ('RENTHOURS').
# Creating a list of hours of bookings for each customer
customer_bookings = cluster_data.groupby('CUSTOMER_HAL_ID')['HOUR_OF_DAY'].apply(list).reset_index(name='RENTHOURS')
customer_bookings.head()
To get the hour of day where a customer has started the most bookings, a third column ('MODUS') is created, that calculates the modus of the list (the most occuring times).
# Adding MODUS coulmn with modus of hours of time of bookings per customer
customer_bookings['MODUS'] = customer_bookings['RENTHOURS'].apply(lambda x: stats.mode(np.array(x).astype(np.int))[0][0])
To get the number of bookings, a customer has done in the period of two years, the length of the list is calculated and added as the value for the column 'COUNT'.
# Adding COUNT column with number of rentals per customer
customer_bookings['COUNT'] = customer_bookings['RENTHOURS'].apply(lambda x: len(x))
For a first impression of the data, a scatterplot is created.
# Scatterplot
plt.figure(figsize=(15, 10), dpi=80)
fig = plt.scatter(data=customer_bookings, x='MODUS',y='COUNT',s=6);
For an overview of the number of customers renting most favorably at certain times, a plot is created where the customers are grouped by ther most occuring start booking time.
# Plotting the number of customers for their most rented time of day
sns.scatterplot(x=customer_bookings.groupby(['MODUS']).count().index, y='COUNT', data=customer_bookings.groupby(['MODUS']).count());
A seasonality of the day can be observed.
It is necessary to map each hour onto a circle such that the lowest value for that variable appears right next to the largest value, by computing the x- and y- component of that point using sin and cos trigonometric functions. This transformation will be applied to all time features in this analysis.
source: Link
# Adding sin and cos of hour for more natural calculations
customer_bookings['HOUR_SIN'] = np.sin(customer_bookings.MODUS * (2. * np.pi / 24))
customer_bookings['HOUR_COS'] = np.cos(customer_bookings.MODUS * (2. * np.pi / 24))
Because the units of time and count are different, the data has to be scaled. This makes it possible to calcualte more natural clusters.
# Scale data for clustering
X2 = customer_bookings[['HOUR_SIN', 'HOUR_COS', 'COUNT']]
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
X2_scaled_df = pd.DataFrame(X2_scaled,columns=X2.columns, index = X2.index)
X2_scaled_df.head()
To get a visual estimate of the best amount of clusters, an elbow method is plotted.
# Plotting elbow method
plot_elbow_graph(X2_scaled_df)
The plot of the elbow method suggests that 5 clusters are a good fit.
# Fit a k-means++ model with the booking start counts
four_means = KMeans(n_clusters = 4, random_state=1337, init='k-means++');
four_means.fit(X2_scaled);
four_means.predict(X2_scaled);
For an easier interpretation of the later plot, the first scaled data is now scaled back.
# Scaling back for better understanding of plots
X2_new_inverse = scaler.inverse_transform(X2_scaled);
X2_new_inverse_df = pd.DataFrame(X2_new_inverse,columns=X2.columns, index = X2.index);
X2_new_inverse_df['MODUS'] = customer_bookings['MODUS'];
X2_new_inverse_df.describe()
# Plotting clusters
# Use colours that do not discriminate against colour blindness.
flatui = ["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]
numbers = ["zero", "one", "two", "three", "four"]
X2_new_inverse_df["four"] = four_means.predict(X2_scaled);
X2_new_inverse_df["four"] = X2_new_inverse_df["four"].apply(lambda x: numbers[x]);
sns.pairplot(data=X2_new_inverse_df,height=8,hue="four",palette=sns.color_palette(flatui));
For a better visualization, the plot in the second to last row on the right is plotted seperately in in the following cell.
# Scatterplot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=X2_new_inverse_df, y='COUNT', x='MODUS', hue='four', palette=sns.color_palette(["#d7191c", "#fdae61", "#ffffbf", "#abd9e9"]),s=50);
The customers clusters can be divided into two types:
There are three types of customers who rent between 1 and ca. 100 times:
Then there are of customers who rent between 100 and 800 times:
In this section the trajectory of different resolutions is plotted. It can be seen that rental demand is especially high in the middle of the year and low at the start and end. In addition the graphs show that demand is highly volatile with an increasing variance in the middle of the year.
# Creating fig and subplots
fig, axs = plt.subplots(4, 1, figsize=(17, 15), sharex=True)
# One hour resolution plot
one_h_data['COUNT'].plot(ax=axs[0])
axs[0].grid(True)
axs[0].set_title('Hourly')
# Two hour resolution plot
two_h_data['COUNT'].plot(ax=axs[1])
axs[1].grid(True)
axs[1].set_title('Two Hour Resolution')
# Six hour resolution plot
six_h_data['COUNT'].plot(ax=axs[2])
axs[2].grid(True)
axs[2].set_title('Six Hour Resolution')
# 24 hour resolution plot
twentyfour_h_data['COUNT'].plot(ax=axs[3])
axs[3].grid(True)
axs[3].set_title('Daily')
axs[3].set_xlabel('Date')
fig.text(0.5, 0.92, 'Bike Rental Demand', ha='center', size=30);
In order to detail the above insights in the following box-plots will be used to inspect the mean shifts that occur on a (sub-)daily, weekly and monthly bases. Since the above showed seasonal differences that depend on the month of the year, the data will be enriched by the information in which season of the year it falls in, in order to filter out the impact of a specific season on (sub-)daily rentals.
Conclusions that can be drawn from the below graphs are as follows:
# Creating individual HOUR, DAY_OF_WEEK, DAY_OF_MONTH, WEEK_OF_YEAR and MONTH_OF_YEAR columns
one_h_data['HOUR'] = one_h_data.index.strftime('%H')
one_h_data['DAY_OF_WEEK'] = one_h_data.index.strftime('%a')
one_h_data['DAY_OF_MONTH'] = one_h_data.index.strftime('%d')
one_h_data['WEEK_OF_YEAR'] = one_h_data.index.strftime('%W')
one_h_data['MONTH_OF_YEAR'] = one_h_data.index.strftime('%m')
# Creating a specific cat type to allow logical ordering of DAY_OF_WEEK
cat_type = pd.api.types.CategoricalDtype(categories=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], ordered=True)
one_h_data['DAY_OF_WEEK'] = one_h_data['DAY_OF_WEEK'].astype(cat_type)
# Creating a SEASON column based on the specific month
seasons = []
for month in one_h_data.index.strftime('%m').astype('int'):
if month in [1, 2, 12]:
seasons.append('WINTER')
elif month in [3, 4, 5]:
seasons.append('SPRING')
elif month in [6, 7, 8, 9]:
seasons.append('SUMMER')
elif month in [10, 11]:
seasons.append('FALL')
one_h_data['SEASON'] = seasons
# Creating fig and subplots
fig = plt.figure(figsize=(17, 30))
ax1 = fig.add_subplot(511)
ax2 = fig.add_subplot(512)
ax3 = fig.add_subplot(513)
ax4 = fig.add_subplot(514)
ax5 = fig.add_subplot(515)
# Plotting hour of the day boxplots
hue_order = ['SPRING', 'SUMMER', 'FALL', 'WINTER']
sns.boxplot(x='HOUR', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax1)
ax1.set_title('Hour of the Day')
ax1.set_xlabel('')
ax1.grid(True)
# Plotting day of the week boxplots
sns.boxplot(x='DAY_OF_WEEK', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax2).legend_.remove()
ax2.set_title('Day of the Week')
ax2.set_xlabel('')
ax2.grid(True)
# Plotting day of the month boxplots
sns.boxplot(x='DAY_OF_MONTH', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax3).legend_.remove()
ax3.set_title('Day of the Month')
ax3.set_xlabel('')
ax3.grid(True)
# Plotting week of the year boxplots
sns.boxplot(x='WEEK_OF_YEAR', y='COUNT', data=one_h_data, ax=ax4)
ax4.set_title('Week of the Year')
ax4.set_xlabel('')
ax4.grid(True)
# Plotting month of the year boxplots
sns.boxplot(x='MONTH_OF_YEAR', y='COUNT', data=one_h_data, ax=ax5)
ax5.set_title('Month of the Year')
ax5.set_xlabel('')
ax5.grid(True)
fig.text(0.5, 0.92, 'Seasonalities in Bike Rental Demand', ha='center', size=30);
In the following chapter predictive analysis will be conducted for each resolution. First a feature examination and feature engineering will be carried out. Second appropriate algorithms will be trained and their performance evaluated using rolling cross validation.
The task of daily bike rental demand is very important. It can help to increase service quality since the re-distribution of bikes can be planned roughly for the following day. It enables the company to allocate the necessary amount of ressources (e.g. service-workers), which is difficult on a sub-daily basis. It therefore depicts a central part of mid-term planning.
In addition to the descriptive analysis which showed that the trajectory itself already contains information in regarding to its realization a specific time or time frame, other useful features have been added during data preparation. The following graphs show the correlation between variables, represented by a fitted linear regression line, on the upper diagonal, the frequency on the main diagonal and the conditional distribution of each variable in respect to the other on the lower diagonal.
As one could imagine the graphs show that sunshine duration as well as temperature are positively correlated to the number of bookings, which seems reasonable since people should be more likely to ride a bike when it is dry and warm outside. Not surprisingly do humidity, the precipitation height as well as wind speed exert a negative impact on the number of bookings. Precipitation height however does this with a lot of uncertainty which can be explained by the fact that zero is the most frequent precipitation height leaving relatively few observations to observe the impact on bike rental demand.
All in all it can be seen, that any numerical feature exerts an influence on the target variable. Therefore all of these features will be included during prediction and the above assumptions will be re-evaluated.
# Creating a subset of only numerical features
numerical_features = twentyfour_h_data[['COUNT', 'SUNSHINE_DURATION', 'AIR_TEMP', 'HUMIDITY',
'PRECIPITATION_HEIGHT', 'MEAN_WIND_SPEED']]
# Initializing and plotting a pair grid with regression plots on the upper, hist plots on the main and kde plots
# on the lower diagonal
g = sns.PairGrid(numerical_features)
g = g.map_upper(sns.regplot, scatter_kws={'alpha':0.7}, line_kws={'color': 'red'})
g = g.map_diag(plt.hist)
g = g.map_lower(sns.kdeplot, legend=False);
The box-plots below visualize the difference in mean within the range of the binary features. It can be seen that all three binary features differ within its range. Therefore all three variables will be used in prediction.
# Creating fig and subplots
fig, ax = plt.subplots(1, 3, figsize=(20, 15), sharey=True)
# Plotting rain fallen boxplots
sns.boxplot(x='RAIN_FALLEN', y='COUNT', data=twentyfour_h_data, ax=ax[0])
ax[0].set_title('Rain Fallen')
ax[0].set_xlabel('')
# Plotting bank holiday boxplots
sns.boxplot(x='BANK_HOLIDAY', y='COUNT', data=twentyfour_h_data, ax=ax[1])
ax[1].set_title('Bank Holiday')
ax[1].set_xlabel('')
ax[1].set_ylabel('')
# Plotting school holiday boxplots
sns.boxplot(x='SCHOOL_HOLIDAY', y='COUNT', data=twentyfour_h_data, ax=ax[2])
ax[2].set_title('School Holiday')
ax[2].set_xlabel('')
ax[2].set_ylabel('');
The descriptive analysis has shown that the number of counts varied over time in a seasonal manner. Since the plots showed an almost cyclical pattern the following plots show the correlation between the actual values and its lags. It becomes clear that there is autocorrelation in this data which allows the usage of traditional time series models for prediction.
# Creating fig and subplots
fig, axes = plt.subplots(2, 5, figsize=(20, 7), sharex=True, sharey=True, dpi=500)
# Plotting the relationship between the actual count and its lagged values
for i, ax in enumerate(axes.flatten()[:10]):
pd.plotting.lag_plot(twentyfour_h_data['COUNT'], lag=i + 1, ax=ax)
ax.set_title('Lag ' + str(i + 1))
plt.tight_layout();
The underlying daily data contains some seasonal and time dependent factors as well as some correlation with the chosen exogenous features. Since in this section of the notebook it is dealt with data on a daily resolution time-series algorithms can be used as daily resolution is the smallest acceptable granularity for most of these algorithms.
Therefore in the following two algorithms (SARIMAX and PROPHET) will be fit and evaluated in regards to their fitting abilities (explanatory power) and their predictive performance based on a rolling one-step-ahead forecast.
The SARIMAX model is appropriate for this prediction as the data contains seasonal characteristics, autocorrelation and correlation between exogenous regressors. Less complex or more efficient models like ARIMA or ARMA do not allow for the use of exogenous regressors and do not account for seasonality, which makes them inappropriate for this matter.
To fit a good model one must find suitable parameters. The plots below show the autocorrelation as well as the partial autocorrelation function of the differenced time series.
The ACF shows a first peak crossing the lower confidence interval at lag one indicating a non seasonal MA parameter of order one. It also shows recurring peaks every eighth lag indicating a seasonal MA parameter of order 8 at period for the eighth period.
The PACF shows a first peak crossing the lower confidence interval at lag one indicating a non seasonal AR parameter of order one. It however shows no seasonal pattern indicating a seasonal AR parameter of order 0.
# Creating fig and subplots
fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
# Plotting the autocorrelation of the actual count
fig = plot_acf(twentyfour_h_data['COUNT'].diff().dropna(axis=0),lags=50, ax=ax1)
# Plotting the partial-autocorrelation of the actual count
fig = plot_pacf(twentyfour_h_data['COUNT'].diff().dropna(axis=0),lags=50, ax=ax2)
# Initiating the order and oconfig of the sarimax model
sarimax_model = SARIMAX(endog=twentyfour_h_data['COUNT'], exog=twentyfour_h_data.drop(columns=['COUNT'], axis=1),
order=(1, 1, 1), seasonal_order=(0, 0, 1, 8), freq='D')
%%time
# Fitting the sarimax model
sarimax_result = sarimax_model.fit(maxiter=1000, disp=False, method='lbfgs')
The below summary shows that the coefficients of almost all exogenous variables are significant at the one percent level. Only the variable feature is insignificant on all common significant levels. In addition the coefficients seem to be significant indicating a reasonable choice of seasonal and non-seasonal orders.
Additionally, the coefficient signs seem reasonable and go in line with the findings during feature examination. An example for this can be the high positive influence of air temperature or the negative influence of school holiday.
# Printing the summary of the fitted model
sarimax_result.summary()
# Plotting diagnostical residual plots
fig = sarimax_result.plot_diagnostics(figsize=(17,10));
The above plots indicate that the model definition has been done reasonably well since the resulting residuals show an approximately normal distribution which is one of the key assumptions of SARIMA models. In addition most of the correlation has been removed by the model. However there is still some correlation left. This indicates that the current model does not perfectly depict the true model. To find the true model a grid search for hyper-parameters should be conducted since the determination of the true parameters gets difficult if seasonal as well as non-seasonal autoregressive and moving average characteristics are present.
# Plotting the fitted model
pd.DataFrame({'ACTUALS': twentyfour_h_data['COUNT'], 'PREDICTIONS': sarimax_result.fittedvalues}).plot(figsize=(20, 10));
The above graph shows the fitted values of the model which are a result of in-sample predictions. It can be seen that the model fits the data relatively well though it seems to have problems to predict comparatively high or low bike rental demand. This however is not surprising as the trajectory is highly volatile which may make it difficult for a relatively simple model representation such as SARIMAX. This however does not mean that the model has a low predictive performance. In order to evaluate this perfomance on unseen data has to be examined. Since there are other models that can deal with multiple seasonality and high volatility, SARIMAX will be seen as a baseline for the prediction of daily data in the following analysis, and will be benchmarked after competing model has been trained.
Facebook developed a time series algorithm that fits a model that decomposes the series into three main model components: trend, seasonality and holidays. In addition it allows for external regressors to be included.
The main benefits of the PROPHET model are as follows:
For further information please consult https://peerj.com/preprints/3190.pdf
Therefore in the following a PROPHET model will be fit with adjusted hyperparameters accounting for the high volatility of the data.
# Creating a copy of twentyfour_h_data so that prophet can interpret input
prophet_df = twentyfour_h_data.copy()
prophet_df = prophet_df.reset_index().rename(columns={'DATE_FROM': 'ds', 'COUNT': 'y'})
# Creating a holiday dataframe so prophet can interpret them correctly
BANK_HOLIDAYS = pd.DataFrame({
'holiday': 'BANK_HOLIDAYS',
'ds': pd.to_datetime(prophet_df.loc[prophet_df.BANK_HOLIDAY == 1, 'ds']),
'lower_window': 0,
'upper_window': 1,
})
SCHOOL_HOLIDAYS = pd.DataFrame({
'holiday': 'SCHOOL_HOLIDAYS',
'ds': pd.to_datetime(prophet_df.loc[prophet_df.SCHOOL_HOLIDAY == 1, 'ds']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((BANK_HOLIDAYS, SCHOOL_HOLIDAYS))
# Dropping columns BANK_HOLIDAY and SCHOOL_HOLIDAY
prophet_df.drop(columns=['BANK_HOLIDAY', 'SCHOOL_HOLIDAY'], axis=1, inplace=True)
# Initialising the prophet model
prophet_model = Prophet(holidays=holidays, seasonality_mode='additive', changepoint_prior_scale=0.5,
changepoint_range=0.9, uncertainty_samples=100)
# Adding exogenous regressors to the initialised model
prophet_model.add_regressor('SUNSHINE_DURATION')
prophet_model.add_regressor('AIR_TEMP')
prophet_model.add_regressor('PRECIPITATION_HEIGHT')
prophet_model.add_regressor('MEAN_WIND_SPEED')
prophet_model.add_regressor('HUMIDITY')
prophet_model.add_regressor('RAIN_FALLEN');
%%time
# Fitting the model
prophet_model.fit(prophet_df);
Looking at the time it could already be seen that PROPHET fits very fast. It is about 30x faster than the SARIMAX model in regards to training time.
# Forecasting the fitted data for seasonal and residual plots
forecast = prophet_model.predict(prophet_df);
# Plotting the model components
fig = prophet_model.plot_components(forecast, weekly_start=1);
The above graph shows trend, seasonal patterns as well as the influence of the exogenous regressors and holidays.
Trend
It seems like the model has not detected the correct trend in the data since the curve looks more like a seasonal pattern.
Holidays
Prophet has also recognized the negative impact of school and business holidays it attributes an impact of sometimes 150 bikes less during a period of holidays.
Weekly Seasonality
The algorithm has detected the same cyclical nature as found in the descriptive analysis. Bike rental demand increases from the start of the week until thursday and then decreases again to the end of the week.
Yearly Seasonality
Overall it can be said that the algorithm detects the yearly seasonal patter in a reasonable way. On second thought however some of the illustrated patterns seem suspicious. Namely these consist of the dip in april which usually shows an increase through better weather conditions and the increase in november which cannot be observed looking at the raw time series. This can either be related to the not perfectly defined trend, or is true after the effects of other variables have been filtered out.
External Regressors
This graph is difficult to interpret since the impacts of all regressors are cumulated therefore the influence of each individual regressor is unclear. However it looks like the influence changes almost seasonally just like climate data is expected to.
# Creating fig and subplots
fig, axs = plt.subplots(2, 1, figsize=(17, 10), sharex=True)
# Plotting the fitted model
pd.DataFrame({'Actuals': prophet_df['y'], 'Predictions': forecast['yhat']}).plot(ax=axs[0])
# Plotting the residuals
pd.DataFrame({'Residuals': prophet_df['y'] - forecast['yhat']}).plot(ax=axs[1]);
The above graphs show the fit of the model as well as the residuals. It seems that PROPHET performed not as good as the SARIMAX model in terms of fit to the data. However since the aim is to get good predictive accuracy on unseen data the models have to be cross validated.
In the following a rolling one step ahead cross validation will be conducted. For this purpose the models will be trained using data until a specific point in time t forecasting t + 1. This step will be repeated 41 times taking a nine period step into the future until there is no data left.
# Setting the period parameter
period = 9
# Setting the horizon parameter
horizon = 1
%%time
# Creating a list to store the prediction results
sarimax_predictions = []
# Looping throug the dataset creating a onestep-ahead prediction
for date in twentyfour_h_data.loc['2016-01-05'::period, :].index:
# Setting the date to forecast
fc_date = date + relativedelta(days=horizon)
# Creating test and train splits
sarimax_train_endog = twentyfour_h_data.loc['2015-01-01':date, 'COUNT']
sarimax_train_exog = twentyfour_h_data.loc['2015-01-01':date, 'SUNSHINE_DURATION':]
sarimax_test_exog = twentyfour_h_data.loc[fc_date:fc_date, 'SUNSHINE_DURATION':]
# Initialising the order and parameters
sarimax_model = SARIMAX(endog=sarimax_train_endog, exog=sarimax_train_exog, order=(1, 0, 1),
seasonal_order=(0, 0, 1, 8), freq='D', simple_differencing=True)
# Fitting the model to training data
sarimax_osh_model = sarimax_model.fit(maxiter=1000, disp=False, method='lbfgs')
# One-step-ahead prediction
predicted_values = sarimax_osh_model.predict(start=fc_date, exog=sarimax_test_exog)
# Appending the prediction to the list
sarimax_predictions.append(predicted_values.values[0])
%%time
# Starting a one-step-ahead prediction
prophet_cv = cross_validation(prophet_model,
initial='365.25 days',
period='{} days'.format(period),
horizon='{} days'.format(horizon))
# Creating a dataframe containig the actuals and predictions of sarimax and prophet model
daily_evaluation = pd.DataFrame({'Actuals': prophet_cv['y'],
'Prophet': prophet_cv['yhat'],
'SARIMAX': sarimax_predictions}).set_index(prophet_cv['ds'])
# Calculating the residuals of both models
sarimax_residuals = prophet_cv['y'].values - sarimax_predictions
prophet_residuals = prophet_cv['y'].values - prophet_cv['yhat'].values
# Creating a dataframe containing the residuals of both models
daily_residuals = pd.DataFrame({'Prophet': prophet_residuals,
'Sarimax': sarimax_residuals}).set_index(prophet_cv['ds'])
# Creating fig and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 10), sharex=True)
# Plotting the actuals and predictions
daily_evaluation.plot(ax=axs[0])
# Plotting the residuals
daily_residuals.plot(ax=axs[1]);
# Printing the mae of both models
print('Mean Absolut Error of Prohet: {0:.2f}\n'.format(performance_metrics(prophet_cv).iloc[0, 3]))
print('Mean Absolut Error of Sarimax: {0:.2f}'.format(abs(sarimax_residuals).mean()))
# Printing the mae of both models
print('Mean Absolut Error of Prohet: {0:.2f}\n'.format(performance_metrics(prophet_cv).iloc[0, 3] / twentyfour_h_data['COUNT'].mean()))
print('Mean Absolut Error of Sarimax: {0:.2f}'.format(abs(sarimax_residuals).mean() / twentyfour_h_data['COUNT'].mean()))
Looking at the above it gets clear that PROPHET is a better model for the underlying data. Apart from the fact that the PROPHET algorithm fits considerably faster it is also better at predicting the bike rental demand which can be seen comparing the mean absolute error of both models.On average Prophet eight bikes closer to the actual demand than SARIMAX is. Even though this is only a small predictive improvement, Prophet is around 40 times more efficient in terms of computational time.
The following section of this notebook deals with the prediction of sub-daily data. Even though PROPHET can be used for sub-daily data, it performed poorly and is therefore not considered for the prediction in this section.
The following steps will be carried out for each of the re-sampled resolutions:
Using hourly data a more detailed comparison will be conducted between a baseline model (linear regression), random forest regression with default hyperparameters and a random forest regression with optimized hyperparameters. In this this the suitability of the models will be assessed and a ramdomized grid search for hyperparameters will be carried out. These hyperparameters will also be used for the other resolutions. Since the feature examination process is equals for all resolutions only the process for hourly data will be illustrated.
Similar to the daily data features in the hourly data show reasonable correlations. The main differences are that the correlation between counts and the sunshine duration seems to be less strong and that the wind speed is positively correlated with bike rental demand. Also the the shape of correlation between air temp and the bike rental demand seems to be not linear.
Since all of the variables seem to have an effect they will be included as predictors.
# Creating a subset of only numerical features
numerical_features = one_h_data[['COUNT', 'SUNSHINE_DURATION', 'AIR_TEMP', 'HUMIDITY',
'PRECIPITATION_HEIGHT', 'MEAN_WIND_SPEED']]
# Initializing and plotting a pair grid with regression plots on the upper, hist plots on the main and scatter
# plots on the lower diagonal
g = sns.PairGrid(numerical_features)
g = g.map_upper(sns.regplot, scatter_kws={'alpha':0.7}, line_kws={'color': 'red'})
g = g.map_diag(plt.hist)
g = g.map_lower(plt.scatter);
From the below boxplots it can be seen that the means in terms of the binary variable school holiday are indifferent. This feature will therefore not be used during analysis.
Bank holiday and rain fallen show a difference in the mean and will be included.
# Creating fig and subplots
fig, ax = plt.subplots(1, 3, figsize=(20, 15), sharey=True)
# Plotting rain fallen boxplots
sns.boxplot(x='RAIN_FALLEN', y='COUNT', data=one_h_data, ax=ax[0])
ax[0].set_title('Rain Fallen')
ax[0].set_xlabel('')
# Plotting bank holiday boxplots
sns.boxplot(x='BANK_HOLIDAY', y='COUNT', data=one_h_data, ax=ax[1])
ax[1].set_title('Bank Holiday')
ax[1].set_xlabel('')
ax[1].set_ylabel('')
# Plotting school holiday boxplots
sns.boxplot(x='SCHOOL_HOLIDAY', y='COUNT', data=one_h_data, ax=ax[2])
ax[2].set_title('School Holiday')
ax[2].set_xlabel('')
ax[2].set_ylabel('');
Looking at the autocorrelation it can be seen that there is significant autocorrelation at the first lag. Correlation vanishes significantly after 5 lags. Therefore 5 lags will be included.
# Creating fig and subplots
fig, axes = plt.subplots(2, 5, figsize=(20, 7), sharex=True, sharey=True, dpi=500)
# Plotting the relationship between the actual count and its lagged values
for i, ax in enumerate(axes.flatten()[:10]):
pd.plotting.lag_plot(one_h_data['COUNT'], lag=i + 1, ax=ax)
ax.set_title('Lag ' + str(i + 1))
plt.tight_layout();
# Dropping columns that do not have to seem an effect
one_h_data.drop(columns=['SCHOOL_HOLIDAY', 'BANK_HOLIDAY', 'HOUR', 'DAY_OF_WEEK', 'DAY_OF_MONTH',
'WEEK_OF_YEAR', 'MONTH_OF_YEAR', 'SEASON'], axis=1, inplace=True)
# Generating columns that contain lagged values of the targed variable
one_h_data['lag_1'] = one_h_data['COUNT'].shift(periods=1)
one_h_data['lag_2'] = one_h_data['COUNT'].shift(periods=2)
one_h_data['lag_3'] = one_h_data['COUNT'].shift(periods=3)
one_h_data['lag_4'] = one_h_data['COUNT'].shift(periods=4)
one_h_data['lag_5'] = one_h_data['COUNT'].shift(periods=5)
# Dropping missing values as a result of the shifts and reset the index
one_h_data = one_h_data.dropna(axis=0).reset_index()
In discriptive analysis it was already shown that means vary by seasons therefore seasons will be included as one-hot-encoded binary variables.
# Creating a SEASON list based on the specific month
seasons = []
for month in one_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
if month in [1, 2, 12]:
seasons.append('WINTER')
elif month in [3, 4]:
seasons.append('SPRING')
elif month in [5, 6, 7, 8, 9]:
seasons.append('SUMMER')
elif month in [10, 11]:
seasons.append('FALL')
# Merging the seasons as one-hot-encoded binary variables
one_h_data = one_h_data.merge(pd.get_dummies(seasons, drop_first=False), how='left', left_index=True, right_index=True)
Descriptive analysis has shown that the hour of the day varies in means, therefore the hour of the day seems to depict a suitable feature for prediction. Since times do have a cyclical nature (i.e. 11pm is as close to 0 am as 1am) we tranform the hour of the day using the sine and cosine transformation. In this we map each the hour onto a circle such that the lowest value for that variable appears right next to the largest value, by computing the x- and y- component of that point using sin and cos trigonometric functions. This transformation will be applied to all time features in this analysis.
source: Link
# Creating an individual column for hour of the day
one_h_data['HOUR'] = one_h_data.DATE_FROM.dt.strftime('%-H').astype('int')
# Applying sine,cosine transformation on column hour to retain the cyclical nature
one_h_data['HOUR_SIN'] = np.sin(one_h_data.HOUR * (2. * np.pi / 24))
one_h_data['HOUR_COS'] = np.cos(one_h_data.HOUR * (2. * np.pi / 24))
# Creating an individual column for week of the year
one_h_data['WEEK_OF_YEAR'] = one_h_data.DATE_FROM.dt.strftime('%W').astype('int')
# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
one_h_data['WEEK_OF_YEAR_SIN'] = np.sin(one_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
one_h_data['WEEK_OF_YEAR_COS'] = np.cos(one_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
# Creating an individual column for day of the week
one_h_data['DAY_OF_WEEK'] = one_h_data.DATE_FROM.dt.strftime('%w').astype('int')
# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
one_h_data['DAY_OF_WEEK_SIN'] = np.sin(one_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
one_h_data['DAY_OF_WEEK_COS'] = np.cos(one_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
# Dropping individual time columns since their transformation will be used
one_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK', 'HOUR'], axis=1, inplace=True)
The below code defines a function that is used for cross validation. The random cross validation is used to optimize the hyperparamters of the random forest regression. The functionality is described according to PEP8 guidelines.
def TimeSeriesCV(data, params=None, grid_search=False, no_of_forecasts=104, horizon=1, seed=1, func=None,
date_col='DATE_FROM', dep_col='COUNT', cv_method='random', initial=None, period=None):
"""
Function that runs either a rolling or random cross validation.
Rolling cross validation fits the model on the specified initial date range and predicts the horizon
then refits on the initial plus the specified period and predicts the hroizon. This is repeated until
there is no new data left to refit on.
Random cross validation divides the data set in half and selects unique time points as cutoffs.The model
is then fit on the data points until this cutoff and predicts the horizon. This procedure stops
until no cutoff is left. It therefore proceeds similar to a rolling forecast however choosing the cutoffs
randomly.
If random cross-validation is chosen grid search for optimal hyperparameters with sklearns' random forest
regression can be run.
:param data: a dataframe containing all features and data
:param params: dict containing hyperparameters for sklearns' random forest regression
:param grid_search: boolean indication whether function is used for grid search
:param no_of_forecasts: int defining how many cutoffs should be chosen i.e.the number of models to fit
:param horizon: int defining the amount of periods to forecast ahead
:param seed: int defining the random state for cutoff selection
:param func: a model object with hyperparameters set; will be ignored when grid_search is set to true
:param date_col: string specifying the column containing the date
:param dep_col: string specifiying the column containing the target value
:param cv_method: string specifying the cross validation method either 'rolling' or 'random'
:param initial: int specifying the amount of initial data points used for training in cv method rolling
:param period: the amount of additional data points to include in the next forecast step or rolling cv
:return: if grid_search set to true: dict containing the hyperparameters and the corresponding mae
:return: if grid_search set to false: a dataframe containing predictions, actuals and corresponding dates
and a list object containing the mae of predictions for each horizon
"""
# Saving the actual time
start_time = time.time()
# Checking for method
if cv_method == 'random':
# If seed is specified then set the seed
if seed:
np.random.seed(seed)
# Selecting random cutoff points
cutoffs = np.random.permutation(np.arange(len(data) / 2, len(data)))[:no_of_forecasts]
# Initialising an array of zeors with length horizon
residuals = np.zeros(horizon)
# Initialising a dataframe to append the results to
result_frame = pd.DataFrame(
{'CUTOFF': [], 'PREDICTED_PERIOD': [], 'STEP_AHEAD': [], 'ACTUAL': [], 'PREDICTION': []})
# Looping through the cutoffs predicting the cutoff + horizon
for cut in cutoffs:
cut = int(cut)
# Train test split
train_data = data[:cut]
test_data = data[len(train_data):len(train_data) + horizon]
train_data_y = train_data[dep_col]
test_data_y = test_data[dep_col]
train_data_X = train_data.drop(columns=[dep_col, date_col], axis=1)
test_data_X = test_data.drop(columns=[dep_col, date_col], axis=1)
# If not grid search use the specified function otherwise run random forest regression
# with specified hyperparameters
if not grid_search:
# Fit the model on training data
reg = func.fit(train_data_X, train_data_y)
# Predict the test data
predictions = reg.predict(test_data_X)
else:
# Initialising the random forest regression model object
func = RandomForestRegressor(n_estimators=params['n_estimators'],
max_features=params['max_features'],
max_depth=params['max_depth'],
min_samples_split=params['min_samples_split'],
min_samples_leaf=params['min_samples_leaf'],
bootstrap=params['bootstrap'],
random_state=123, n_jobs=-1)
# Fit the model on training data
reg = func.fit(train_data_X, train_data_y)
# Predict the test data
predictions = reg.predict(test_data_X)
# Looping through the preidictions and adding the absolut error in respect horizon
# and appending the actual predictions to the dataframe
for counter, value in enumerate(predictions):
residuals[counter] += abs(value - test_data_y.iloc[counter])
result_frame = pd.concat([result_frame,
pd.DataFrame({'CUTOFF': [train_data[date_col].iloc[-1]],
'PREDICTED_PERIOD': [test_data[date_col].iloc[counter]],
'STEP_AHEAD': [counter + 1],
'ACTUAL': [test_data_y.iloc[counter]],
'PREDICTION': [value]})])
# Calculating the mean absolut error with respect to the horizon dividing each absolut error of
# each horizon by the number of forecasts
residuals = residuals / no_of_forecasts
# If not grid search return dataframe and residuals otherwise return a dict
if not grid_search:
print('Finished RandomCV in {} seconds'.format(round(time.time() - start_time)))
return result_frame, residuals
else:
print('Finished Regressor with: {} in {} seconds and an accuracy of {}'.format(params, round(
time.time() - start_time), residuals.mean()))
return {residuals.mean(): params}
elif cv_method == 'rolling':
# Initialising an array of zeors with length horizon
residuals = np.zeros(horizon)
# Initialising a dataframe to append the results to
result_frame = pd.DataFrame(
{'CUTOFF': [], 'PREDICTED_PERIOD': [], 'STEP_AHEAD': [], 'ACTUAL': [], 'PREDICTION': []})
# Looping through the data starting with the initial specified and creating forecasts of length
# horizon after each specified period
for i in data.iloc[initial::period, :].index:
cut = int(i)
# Train test split
train_data = data[:cut]
test_data = data[len(train_data):len(train_data) + horizon]
train_data_y = train_data[dep_col]
test_data_y = test_data[dep_col]
train_data_X = train_data.drop(columns=[dep_col, date_col], axis=1)
test_data_X = test_data.drop(columns=[dep_col, date_col], axis=1)
# Fit the model on training data
reg = func.fit(train_data_X, train_data_y)
# Predict the test data
predictions = reg.predict(test_data_X)
# Looping through the preidictions and adding the absolut error in respect horizon
# and appending the actual predictions to the dataframe
for counter, value in enumerate(predictions):
residuals[counter] += abs(value - test_data_y.iloc[counter])
result_frame = pd.concat([result_frame,
pd.DataFrame({'CUTOFF': [train_data[date_col].iloc[-1]],
'PREDICTED_PERIOD': [test_data[date_col].iloc[counter]],
'STEP_AHEAD': [counter + 1],
'ACTUAL': [test_data_y.iloc[counter]],
'PREDICTION': [value]})])
# Calculating the mean absolut error with respect to the horizon dividing each absolut error of
# each horizon by the number of forecasts
residuals = residuals / len(data.iloc[initial::period, :].index)
print('Finished RollingCV in with {} individual forecasts in {} seconds using method {}'.format(
len(data.iloc[initial::period, :].index), round(time.time() - start_time), func))
return result_frame, residuals
else:
print('Invalid cv_method')
In the following a simple linear regression model will be fit. Linear regression can be suitable since it is easy to interpret and solves many machine learning problems in a time efficient manner.
# Splitting data into targed and feature data
X = one_h_data.drop(columns=['DATE_FROM', 'COUNT'], axis=1)
y = one_h_data['COUNT']
# Initialising the linear regression model
lr_model = LinearRegression(n_jobs=-1)
# Fitting the model to the whole dataset
lr_model.fit(X, y);
# Calculating the fit of the model
lr_predictions = lr_model.predict(X)
# Calculating the residuals
lr_residuals = y - lr_predictions
# Plotting the fit of the model
pd.DataFrame({'ACTUALS': y, 'PREDICTIONS': lr_predictions}).plot(figsize=(20, 10));
The above graph plots the predictions of the regression model against the actual values. The model fit is already reasonably well. A problem however is that the model predicts negative values in times where bookings are low and misses out on low bookings in time of high demand. Also very high demand patterns are not caught by the dynamics of the model.
pd.DataFrame({'Feature': X.columns, 'Coefficient': lr_model.coef_})
The above table shows the variables and the resulting coefficients. Again the coefficients are as expected. Important to mention is however that the regression model found the negative influence of mean wind speed which can be explained by the positive correlation of mean wind speed with air temperature. After it was accounted for the positive influence of air temperature the influence of mean wind speed is negative as one would expect it to be.
Due to the probable non-linear relationships that were observed during feature examination and the fact that linear regression predicted negative values, random forest regression seems as an appropriate model for this analysis. It does allow for non-linear relationships and cancels out some key problems that are associated with linear regression (e.g. multicollinearity, strong assumptions about the underlying process) as well as it will predict positive values only if not negative values are present. In addition this method has been performing well on time series data in a whole range of cases.
# Initialising the random forest regression model
rf_model = RandomForestRegressor(n_jobs=-1, random_state=123)
# Fitting the model to the whole dataset
rf_model.fit(X, y);
# Calculating the fit of the model
rf_predictions = rf_model.predict(X)
# Calculating the residuals
rf_residuals = y - rf_predictions
# Creating fit and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 15), sharex=True)
# Plotting actuals and fit of linear and random forest regression
pd.DataFrame({'ACTUALS': y,
'RandomForestRegression': rf_predictions,
'LinearRegression': lr_predictions}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[0])
axs[0].set_title('Actuals vs. RandomForest vs. LinearRegression')
# Plotting residuals of linear and random forest regression
pd.DataFrame({'LinearRegression': lr_residuals,
'RandomForestRegression': rf_residuals}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[1])
axs[1].set_xlabel('Date')
axs[1].set_title('RandomForest Residuals vs. LinearRegression Residuals');
The above graphs compare the predictive power of linear and random forest regression. It can be seen that random forest regression fits the data better in two ways. Firstly the predictions of random forest regression do not contain negative values. Secondly the residuals are lower and their variance is consistent over time. These are strong indicators that the better model for prediction is random forest regression. This however has to be evaluated by cross-validation.
The below code was run in order to search for optimal hyperparameters. Deducing from the sklearn documentation the hyperparameters used are the most important in terms of increasing prediction accuracy. The grid search was conducted using random cross-validation to not overfit the model. Since the process took about 35 Hours the best hyperparameter combination is provided as a dictionary result.
def random_grid_cv(data, seed=1, param_distributions=None, n_iter=100, no_of_forecasts=52, horizon=100):
"""
Wrapper Function that makes use of TimeSeriesCV allowing randomized grid search selecting a random
sample of hyperparameter combinations.
:param data: a dataframe containing all features and data
:param seed: int defining the random state for cutoff and parameter sample selection
:param param_distributions: a dict containing the parameters and a list of values to try
:param n_iter: int specifying the number of different combinations to try
:param no_of_forecasts: int speciying how many cutoffs should been chosen
:param horizon: int specifying the number of periods to forecast ahead
:return: dict containing the hyperparameters and their specific mae
"""
start_time = time.time()
# Drawing a random sample of possible parameter combinations
param_list = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=seed))
# Initialising a list to store the arguments for each individual iteration
exec_list = list()
# Looping through the sample of parameter combinations and merge them with the function parameters
# creating a tuple object for each iteration
for param_variant in param_list:
exec_list.append((data, param_variant, True, no_of_forecasts, horizon, seed))
# Running TimeSeriesCV for each tuple in exec_list
result = list(starmap(TimeSeriesCV, exec_list))
print('Grid_search took: {} seconds'.format(round(time.time() - start_time)))
return result
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=1600, num=15)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 100, num=10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 4, 5, 6, 7, 8, 9, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
grid_results = random_grid_cv(data=one_h_data,
param_distributions=random_grid,
n_iter=215,
no_of_forecasts=52)
{5.330927838827839: {'n_estimators': 200,
'min_samples_split': 9,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 70,
'bootstrap': False}}
# Initialising a random forest regression model with optimized hyperparameters
optimized_rf_model = RandomForestRegressor(n_estimators=200,
min_samples_split=9,
min_samples_leaf=1,
max_features='sqrt',
max_depth=70,
bootstrap=False,
n_jobs=-1,
random_state=123)
# Fitting the model to the whole dataset
optimized_rf_model.fit(X,y);
# Calculating the fit of the model
optimized_rf_predictions = optimized_rf_model.predict(X)
# Calculating the residuals of the model
optimized_rf_residuals = y - optimized_rf_predictions
# Creating fig and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 15), sharex=True)
# Plotting actuals and fit of optimized and and default random forest regression
pd.DataFrame({'ACTUALS':y,'OPTIMIZED':optimized_rf_predictions,
'DEFAULT':rf_predictions}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[0])
axs[0].set_title('Actuals vs. RandomForest Optimized vs. RandomForest Default')
# Plotting residuals of optimized and and default random forest regression
pd.DataFrame({'DEFAULT':rf_residuals,
'OPTIMIZED':optimized_rf_residuals}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[1])
axs[1].set_xlabel('Date')
axs[1].set_title('RandomForest Optimized Residuals vs. RandomForest Default Residuals');
The above graphs compare the predictive power of the optimized and the default random forest regression. Optimized parameters have increased the performance of the random forest regression even further in terms of fit. If it increased predictive performance however has to be evaluated.
# Creating fig
fig = plt.figure(figsize=(20, 10))
# Deriving the feature importances of the optimized and default model
default_importances = list(rf_model.feature_importances_)
optimized_importances = list(optimized_rf_model.feature_importances_)
# Creating an array with range of number of variables
x_values = np.arange(len(default_importances))
# Creating bar plots of optimized and default importances
plt.bar(x_values, default_importances,orientation='vertical', color='blue', width=.5, align='center', label='Default')
plt.bar(x_values + 0.5, optimized_importances, orientation='vertical', color='red', width=.5, align='center', label='Optimized')
plt.xticks(x_values + 0.25, list(X.columns), rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
plt.legend(loc=1)
plt.show();
The graph shows that the optimized hyperparameters distribute the feature importance more evenly. Resulting in a more generalized model. Looking at the feature importance it becomes clear that the first lag exerts the highest influence on bike rental demand prediction followed by lag two. Season as a feature as well as precipitation height and rain fallen could be dropped to achieve a higher efficiency in terms of fitting and predictions times this however is not necessary since the fitting times are appropriate.
The code below runs a rolling one-step-ahead cross validation calculating the mean absolute error of a total number of 119 individual forecasts.
The the minimum of training observations are 8759 observations which is the equivalent to a year of past data. Every 74 periods, which is equivalent to about three days, a one-step-ahead forecast is being produced. The number of 74 periods has been chosen so that different predictions will be carried out at different times of the day.
# Setting the cv_method
cv_method='rolling'
# Setting the initial parameter
initial=8759
# Setting the period parameter
period=74
# setting the horizon parameter
horizon=1
# Performing a linear regression one-step-ahead prediction
lr_rolling_results, lr_rolling_residuals = TimeSeriesCV(one_h_data,
func=lr_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with default hyperparameters one-step-ahead prediction
rf_rolling_results, rf_rolling_residuals = TimeSeriesCV(one_h_data,
func=rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
optimized_rf_rolling_results, optimized_rf_rolling_residuals = TimeSeriesCV(one_h_data,
func=optimized_rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Creating a dataframe containg actuals and the preidictions of each model
rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': rf_rolling_results['PREDICTED_PERIOD'],
'ACTUAL': rf_rolling_results['ACTUAL'],
'RandomForestRegression DEFAULT': rf_rolling_results['PREDICTION'],
'RandomForestRegression OPTIMIZED': optimized_rf_rolling_results['PREDICTION'],
'LinearRegression': lr_rolling_results['PREDICTION']})
# Plotting actuals and preidicitons of each model
rf_comparison.set_index("PREDICTED_PERIOD").plot(figsize=(17,10));
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(rf_rolling_residuals[0]))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(lr_rolling_residuals[0]))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(optimized_rf_rolling_residuals[0]))
print('Improvement due to optmized Random Forest Regression: {0:.2f}%\n'.format((1-(optimized_rf_rolling_residuals[0]/rf_rolling_residuals[0]))*100))
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(rf_rolling_residuals[0] / one_h_data['COUNT'].mean()))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(lr_rolling_residuals[0] / one_h_data['COUNT'].mean()))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(optimized_rf_rolling_residuals[0] / one_h_data['COUNT'].mean()))
The results of rolling one-step-ahead cross validation suggest that random forest regression with optimized hyperparameters yields the best result with a 5% increase compared to prediction performance. Optimizing hyperparameters also decreased the time needed to fit the model which can be seen as an increase in efficiency. The prediction performance however is very close and shows a significant efficiency advantage in terms of time.
The following section deals with two-hourly data. Since the examination as well as the engineering process are equal to hourly data this part will be unannotated. Evaluation, however will be conducted and discussed.
# Dropping columns that do not have to seem an effect
two_h_data.drop(columns=['SCHOOL_HOLIDAY', 'BANK_HOLIDAY'], axis=1, inplace=True)
# Generating columns that contain lagged values of the targed variable
two_h_data['lag_1'] = two_h_data['COUNT'].shift(periods=1)
two_h_data['lag_2'] = two_h_data['COUNT'].shift(periods=2)
# Dropping missing values as a result of the shifts and reset the index
two_h_data = two_h_data.dropna(axis=0).reset_index()
# Creating a SEASON list based on the specific month
seasons = []
for month in two_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
if month in [1, 2, 12]:
seasons.append('WINTER')
elif month in [3]:
seasons.append('SPRING')
elif month in [4, 5, 6, 7, 8, 9]:
seasons.append('SUMMER')
elif month in [10, 11]:
seasons.append('FALL')
# Merging the seasons as one-hot-encoded binary variables
two_h_data = two_h_data.merge(pd.get_dummies(seasons, drop_first=False),how='left', left_index=True, right_index=True)
# Creating an individual column for hour of the day
two_h_data['HOUR'] = two_h_data.DATE_FROM.dt.strftime('%-H').astype('int')
# Applying sine,cosine transformation on column hour to retain the cyclical nature
two_h_data['HOUR_SIN'] = np.sin(two_h_data.HOUR * (2. * np.pi / 24))
two_h_data['HOUR_COS'] = np.cos(two_h_data.HOUR * (2. * np.pi / 24))
# Creating an individual column for week of the year
two_h_data['WEEK_OF_YEAR'] = two_h_data.DATE_FROM.dt.strftime('%W').astype('int')
# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
two_h_data['WEEK_OF_YEAR_SIN'] = np.sin(two_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
two_h_data['WEEK_OF_YEAR_COS'] = np.cos(two_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
# Creating an individual column for day of the week
two_h_data['DAY_OF_WEEK'] = two_h_data.DATE_FROM.dt.strftime('%w').astype('int')
# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
two_h_data['DAY_OF_WEEK_SIN'] = np.sin(two_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
two_h_data['DAY_OF_WEEK_COS'] = np.cos(two_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
# Dropping individual time columns since their transformation will be used
two_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK', 'HOUR'], axis=1, inplace=True)
The code below runs a rolling one-step-ahead cross validation calculating the mean absolute error of a total number of 142 individual forecasts.
The the minimum of training observations are 4379 observations which is the equivalent to a year of past data. Every 31 periods, which is equivalent to about 2,6 days, a one-step-ahead forecast is being produced. The number of 31 periods has been chosen so that different predictions will be carried out at different times of the day.
# Setting the cv_method
cv_method='rolling'
# Setting the initial parameter
initial=4379
# Setting the period parameter
period=31
# setting the horizon parameter
horizon=1
# Performing a linear regression one-step-ahead prediction
two_h_lr_rolling_results, two_h_lr_rolling_residuals = TimeSeriesCV(two_h_data,
func=lr_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with default hyperparameters one-step-ahead prediction
two_h_rf_rolling_results, two_h_rf_rolling_residuals = TimeSeriesCV(two_h_data,
func=rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
two_h_optimized_rf_rolling_results, two_h_optimized_rf_residuals = TimeSeriesCV(two_h_data,
func=optimized_rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Creating a dataframe containg actuals and the preidictions of each model
two_h_rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': two_h_rf_rolling_results['PREDICTED_PERIOD'],
'ACTUAL': two_h_rf_rolling_results['ACTUAL'],
'RandomForestRegression DEFAULT': two_h_rf_rolling_results['PREDICTION'],
'RandomForestRegression OPTIMIZED': two_h_optimized_rf_rolling_results['PREDICTION'],
'LinearRegression': two_h_lr_rolling_results['PREDICTION']
})
# Plotting actuals and preidicitons of each model
two_h_rf_comparison.set_index('PREDICTED_PERIOD').plot(figsize=(17, 10));
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(two_h_rf_rolling_residuals[0]))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(two_h_lr_rolling_residuals[0]))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(two_h_optimized_rf_residuals[0]))
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(two_h_rf_rolling_residuals[0] / two_h_data['COUNT'].mean()))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(two_h_lr_rolling_residuals[0] / two_h_data['COUNT'].mean()))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(two_h_optimized_rf_residuals[0] / two_h_data['COUNT'].mean()))
Again random forest regression with optimized hyper-parameters outperforms the other models. But again only a small increase in predictive performance compared to linear regression is achieved.
# Generating columns that contain lagged values of the targed variable
six_h_data['lag_4'] = six_h_data['COUNT'].shift(periods=4)
# Applying sine,cosine transformation on column hour to retain the cyclical nature
six_h_data = six_h_data.dropna(axis=0).reset_index()
# Creating a SEASON list based on the specific month
seasons = []
for month in six_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
if month in [1, 2, 12]:
seasons.append('WINTER')
elif month in [3]:
seasons.append('SPRING')
elif month in [4, 5, 6, 7, 8, 9]:
seasons.append('SUMMER')
elif month in [10, 11]:
seasons.append('FALL')
# Merging the seasons as one-hot-encoded binary variables
six_h_data = six_h_data.merge(pd.get_dummies(seasons, drop_first=False),how='left', left_index=True, right_index=True)
# Creating an individual column for week of the year
six_h_data['WEEK_OF_YEAR'] = six_h_data.DATE_FROM.dt.strftime('%W').astype('int')
# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
six_h_data['WEEK_OF_YEAR_SIN'] = np.sin(six_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
six_h_data['WEEK_OF_YEAR_COS'] = np.cos(six_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
# Creating an individual column for dasy of the week
six_h_data['DAY_OF_WEEK'] = six_h_data.DATE_FROM.dt.strftime('%w').astype('int')
# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
six_h_data['DAY_OF_WEEK_SIN'] = np.sin(six_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
six_h_data['DAY_OF_WEEK_COS'] = np.cos(six_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
# Dropping individual time columns since their transformation will be used
six_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK'], axis=1, inplace=True)
# Setting the cv_method
cv_method='rolling'
# Setting the initial parameter
initial=1457
# Setting the period parameter
period=11
# setting the horizon parameter
horizon=1
# Performing a linear regression one-step-ahead prediction
six_h_lr_rolling_results, six_h_lr_rolling_residuals = TimeSeriesCV(six_h_data,
func=lr_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with default hyperparameters one-step-ahead prediction
six_h_rf_rolling_results, six_h_rf_rolling_residuals = TimeSeriesCV(six_h_data,
func=rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
six_h_optimized_rf_rolling_results, six_h_optimized_rf_residuals = TimeSeriesCV(six_h_data,
func=optimized_rf_model,
cv_method=cv_method,
initial=initial,
period=period,
horizon=horizon)
# Creating a dataframe containg actuals and the preidictions of each model
six_h_rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': six_h_rf_rolling_results['PREDICTED_PERIOD'],
'ACTUAL': six_h_rf_rolling_results['ACTUAL'],
'RandomForestRegression DEFAULT': six_h_rf_rolling_results['PREDICTION'],
'RandomForestRegression OPTIMIZED': six_h_optimized_rf_rolling_results['PREDICTION'],
'LinearRegression': six_h_lr_rolling_results['PREDICTION']
})
# Plotting actuals and preidicitons of each model
six_h_rf_comparison.set_index('PREDICTED_PERIOD').plot(figsize=(17, 10));
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(six_h_rf_rolling_residuals[0]))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(six_h_lr_rolling_residuals[0]))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(six_h_optimized_rf_residuals[0]))
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(six_h_rf_rolling_residuals[0] / six_h_data['COUNT'].mean()))
print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(six_h_lr_rolling_residuals[0] / six_h_data['COUNT'].mean()))
print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(six_h_optimized_rf_residuals[0] / six_h_data['COUNT'].mean()))
Also at a six-hour resolution the same patterns can be observerd
This work has been a first attempt to predict demand in the case of bikes-haring in the city of Kassel. For prediction different approaches have been chosen depending on the resolution and the process of the underlying data.
For daily data time series algorithms have been used since the minimum requirement of most time series models is that data is at minimum of daily resolution if not weekly or monthly. The algorithms employed were SARIMAx and Facebook's Prophet. Both of them perform reasonably well in predicting daily bike rental demand with a MAE of around 60 bikes a day. These algorithms however, could be improved further by utilizing more data since this will allow the algorithms to truly learn the process in terms of seasonality (i.e monthly seasonality is based on two observations since there are only two realisations of the same month in the data). The SARIMAX could benefit of a grid search for optimal hyper-parameters concerning the order to further improve prediction results. In addition other hyper-parameters of prophet could be examined. Generally. it can be said that prophet delivers a better performance in terms of prediction and computational resources. It is therefore a more efficient and effective and should be employed as the final model.
For sub-daily data most time series algorithms are not appropriate. Therefore linear regression and random forest regression have been utilized for prediction. Linear regression in this case was intended as a baseline model to beat. Random forest regression was trained and a grid search for optimal hyper-parameters has been conducted leading to a small improvement in fit. Contrary to expectations, the linear regression performed close to the more complicated random forest regression. However it was seen that random forest regression explains the data better which can be seen by the much better fit of the model. Since also the prediction is better by around 5% random forest regression should be chosen as the final model. Further improvements could be generated by requiring a calendar of special festive occasions of the city Kassel which was not available for the time frame of the data.
Finally it can be said that prediction could be carried out in a relatively precise manner. However to fully utilize the results of the predictions tools have to be developed that predicts the demand per station as the aim is to increase service quality by reallocating unused bikes. This however requires more complex models that take the fact into account that bikes can only be rented when bikes are present, therefore current usage patterns might not depict the actual demand.
Comparison of mean absolute error (MAE) of different algorithms at different resolutions
| Daily | Hourly | Two-hourly | Six-hourly | |
|---|---|---|---|---|
| SARIMAX | 63.69 | - | - | - |
| Prophet | 55.37 | - | - | - |
| Linear regression | - | 5.45 | 8.60 | 29.58 |
| Random forest regression | - | 6.43 | 11.76 | 32.47 |
| Random forest regression (optimized) | - | 5.17 | 8.23 | 27.64 |